Lab 7

library(tidyverse)
library(mvtnorm)
library(plotly)
library(MASS)
library(ggplot2)

Multivarite Normal

1. Simulation of bivariate normal distribution

  1. First let’s Check the correctness of the formula we did in class.
N= 10000
mymux = 1
mysigx = .5
mymuy = 2
mysigy = 1.5
myrho = .5  #positive moderate correlation

Z1 = rnorm(N)
Z2 = rnorm(N)

x <- mymux + mysigx*Z1


y <- mymuy + mysigy*(myrho*Z1 + sqrt(1-myrho^2)*Z2)

mean(y) # should be ~ 2
## [1] 1.985657
sd(y) # should be ~ 1.5 
## [1] 1.502036
cor(x,y) # should be ~ .5
## [1] 0.4877044
plot(x,y,pch = 46)#positive moderate correlation

We see that mean and standard deviation of y and the correlation coefficient are all close to the theoretical values.

  1. Write a function that does the simulation using the formula that we found in class (checked above)
# Simulate bivariate normal distribution

mybivar = function(size = 1, mu1=0, mu2 = 0, sig1 = 1, sig2 = 1, rho = 0){
  mydf = data.frame(x = rep(NA,size), y = rep(NA,size))
  z1 = rnorm(size)
  z2 = rnorm(size)
  x = sig1*z1 + mu1
  y = sig2*(rho*z1 + sqrt(1-rho^2)*z2) + mu2
  return(data.frame(x = x, y = y))
}
  1. Try it out and plot the result.
N = 5000
mymux = 1
mymuy = 2
mysigx = .5
mysigy = 1
myrho = -.8 #-negative high correlation

mydf = mybivar(N, mymux, mymuy, mysigx, mysigy, myrho)
head(mydf)
##           x         y
## 1 0.8526927 2.2440769
## 2 0.3472761 3.0290779
## 3 0.6698738 2.5912914
## 4 2.0943417 0.8984809
## 5 0.9016436 2.0544028
## 6 0.8070506 2.7382341
plot(mydf, cex = .1) #-negative high correlation

#or 

plot(mydf, pch = 46)
grid(col = 3) 

Try different values of myrho to see how the shape of the scatter plot changes with the correlation coefficient.

  1. Let’s try the 3d plot
dens <- kde2d(mydf$y,mydf$x)

plot_ly(x = dens$x,
        y = dens$y,
        z = dens$z) %>% add_surface() #z- An n[1] by n[2] matrix of the estimated density: rows correspond to the value of x, columns to the value of y.
  1. 3D scatter plot
# threejs Javascript plot
library(threejs)
# Unpack data from kde grid format
x <- dens$x; y <- dens$y; z <- dens$z
# Construct x,y,z coordinates
xx <- rep(x,times=length(y))
yy <- rep(y,each=length(x))
zz <- z; dim(zz) <- NULL
# Set up color range
ra <- ceiling(16 * zz/max(zz))
col <- rainbow(16, 2/3)
# 3D interactive scatter plot
scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")

2. Multivariate normal

Check more on the “mvtnorm” package: https://www.rdocumentation.org/packages/mvtnorm/versions/1.1-3

#install.packages("scatterplot3d") # Install
library("scatterplot3d") # load

library(mvtnorm) #multivariate normal #3d plot set.seed(1)
N <- 2000 # Number of random samples set.seed(123)
# Target parameters for univariate normal distributions 
rho <- -0.9
mu1 <- 1; s1 <- 2 #sigmax
mu2 <- 1; s2 <- 2 #sigmay

# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2) # Covariance matrix

#or
#sigma = diag(2)

XY = rmvnorm(N, mu, sigma) 
fxy = dmvnorm(XY, mu, sigma)
x = as.matrix(XY[,1])
y = as.matrix(XY[,2])
z = as.matrix(fxy)

scatterplot3d(x, y, z, main='rho=-0.9')

3. Conditional expectations from bivariate normal distribution

Make an approximate sample from \(X|Y = 1\).

Since Y is a continuous variable, we won’t be able to find values where Y is exactly equal to 1. Therefore, we will throw away all cases where \(Y\) is not close to 1.That is we keep only the samples where \(.95 < Y < 1.05\).

y0 = 1
keep = mydf$y < y0 + .05 & mydf$y > y0 - .05 #because continous r.v
x.y = mydf$x[keep] #conditional X|Y=1
hist(x.y)

fig <- plot_ly(x = ~x.y, type = "histogram") %>%
  layout(title = 'Distribution of the Conditional Variable X|Y=1')
fig
df <- data.frame(y = x.y)
p <- ggplot(df, aes(sample = y))
p + stat_qq() + stat_qq_line()+
  ggtitle ("Q-Q plot of the Conditional variable")

#qqnorm(x.y)
mean(x.y)
## [1] 1.420377
var(x.y)
## [1] 0.07489792

It is clear that the conditional distribution is also follows a normal distribution.

  1. Theoretical mean and variance:
#given in the picture above
mymux + myrho*mysigx/mysigy*(y0 - mymuy)
## [1] 1.4
(1-myrho^2)*mysigx^2
## [1] 0.09
  1. Scatterplot and theoretical regression line

We must plot x against y.

The red line shows the function \(E(X|Y=y)\)

plot(mydf$y,mydf$x, pch = 46) #joint distribution of x and y
abline(a = mymux - myrho*mysigx/mysigy*mymuy, b = myrho*mysigx/mysigy, col = 2, lwd = 3) 

Multinomial distribution

Recall: The multinomial distribution is a generalization of the binomial distribution to k categories instead of just binary (success/fail). For n independent trials each of which leads to a success for exactly one of k categories, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

Example: Rolling a die N times

Multinomial Examples

1. Chess Game Prediction

Two chess players have the probability Player A would win is 0.40, Player B would win is 0.35, game would end in a draw is 0.25.

The multinomial distribution can be used to answer questions such as: “If these two chess players played 12 games, what is the probability that Player A would win 7 games, Player B would win 2 games, the remaining 3 games would be drawn?”

#12 trails, 3 variables (A win,  B win, draw)
dmultinom(x=c(7,2,3), prob = c(0.4,0.35,0.25))
## [1] 0.02483712

2. Opinion Polls on Election

In a little town, 40% of the eligible voters prefer candidate A, 10% prefer candidate B, 50% have no preference.

You randomly sample 10 eligible voters. What is the probability that 4 will prefer candidate A, 1 will prefer candidate B, 5 will have no preference ?

dmultinom(x=c(4,1,5), prob = c(0.4,0.1,0.5))
## [1] 0.1008

2. Generate Multinomial Random Variables

Generate Multinomial Random Variables With varying Probabilities, given a matrix of multinomial probabilities where rows correspond to observations and columns to categories (and each row sums to 1). Generate a matrix with the same number of rows as my_prob and with m columns. The columns represent multinomial cell numbers, and within a row, the columns are all samples from the same multinomial distribution.

# rmultinom(n, size, prob)
my_prob <- c(0.2,0.3,0.1,0.4)
number_of_experiments <- 20 #trials

(a). Number of Samples = 10

number_of_samples <- 10 

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0    1    3    3    2    3    3    1    1     1     1     1     2     1
## [2,]    2    5    1    2    2    3    0    3    4     3     3     5     3     3
## [3,]    0    1    2    0    2    1    1    0    0     1     2     1     0     0
## [4,]    8    3    4    5    4    3    6    6    5     5     4     3     5     6
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     1     1     2     1     3     3
## [2,]     3     2     2     4     0     2
## [3,]     0     1     1     3     0     0
## [4,]     6     6     5     2     7     5
df=data.frame(experiments) 
df
##   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
## 1  0  1  3  3  2  3  3  1  1   1   1   1   2   1   1   1   2   1   3   3
## 2  2  5  1  2  2  3  0  3  4   3   3   5   3   3   3   2   2   4   0   2
## 3  0  1  2  0  2  1  1  0  0   1   2   1   0   0   0   1   1   3   0   0
## 4  8  3  4  5  4  3  6  6  5   5   4   3   5   6   6   6   5   2   7   5
colSums(df)
##  X1  X2  X3  X4  X5  X6  X7  X8  X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 
##  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
dff= t(df)
colnames(dff) <- c("x1","x2","x3","x4")
rownames(dff) <- NULL
dff
##       x1 x2 x3 x4
##  [1,]  0  2  0  8
##  [2,]  1  5  1  3
##  [3,]  3  1  2  4
##  [4,]  3  2  0  5
##  [5,]  2  2  2  4
##  [6,]  3  3  1  3
##  [7,]  3  0  1  6
##  [8,]  1  3  0  6
##  [9,]  1  4  0  5
## [10,]  1  3  1  5
## [11,]  1  3  2  4
## [12,]  1  5  1  3
## [13,]  2  3  0  5
## [14,]  1  3  0  6
## [15,]  1  3  0  6
## [16,]  1  2  1  6
## [17,]  2  2  1  5
## [18,]  1  4  3  2
## [19,]  3  0  0  7
## [20,]  3  2  0  5
rowSums(dff)
##  [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

(b). plot the marginals

(dfM=data.frame(experiments)/number_of_samples ) #gives me probabilities
##    X1  X2  X3  X4  X5  X6  X7  X8  X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 0.0 0.1 0.3 0.3 0.2 0.3 0.3 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.3
## 2 0.2 0.5 0.1 0.2 0.2 0.3 0.0 0.3 0.4 0.3 0.3 0.5 0.3 0.3 0.3 0.2 0.2 0.4 0.0
## 3 0.0 0.1 0.2 0.0 0.2 0.1 0.1 0.0 0.0 0.1 0.2 0.1 0.0 0.0 0.0 0.1 0.1 0.3 0.0
## 4 0.8 0.3 0.4 0.5 0.4 0.3 0.6 0.6 0.5 0.5 0.4 0.3 0.5 0.6 0.6 0.6 0.5 0.2 0.7
##   X20
## 1 0.3
## 2 0.2
## 3 0.0
## 4 0.5
dffM= t(dfM)
colnames(dffM) <- c("x1","x2","x3","x4")
rownames(dffM) <- NULL
dffM
##        x1  x2  x3  x4
##  [1,] 0.0 0.2 0.0 0.8
##  [2,] 0.1 0.5 0.1 0.3
##  [3,] 0.3 0.1 0.2 0.4
##  [4,] 0.3 0.2 0.0 0.5
##  [5,] 0.2 0.2 0.2 0.4
##  [6,] 0.3 0.3 0.1 0.3
##  [7,] 0.3 0.0 0.1 0.6
##  [8,] 0.1 0.3 0.0 0.6
##  [9,] 0.1 0.4 0.0 0.5
## [10,] 0.1 0.3 0.1 0.5
## [11,] 0.1 0.3 0.2 0.4
## [12,] 0.1 0.5 0.1 0.3
## [13,] 0.2 0.3 0.0 0.5
## [14,] 0.1 0.3 0.0 0.6
## [15,] 0.1 0.3 0.0 0.6
## [16,] 0.1 0.2 0.1 0.6
## [17,] 0.2 0.2 0.1 0.5
## [18,] 0.1 0.4 0.3 0.2
## [19,] 0.3 0.0 0.0 0.7
## [20,] 0.3 0.2 0.0 0.5
rowSums(dffM)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
par(mfrow=c(2,2))

for(i in 1:4) {
  barplot(dffM[,i]) 
}

3. Topic Modeling (LDA)

Example: In Data mining, When we discuss everything in terms of text classification, i.e. Topic Modeling:

Each document has its own distribution over topics. Each topic has its own distribution over the words.

The (multinomial) distribution over words for a particular topic The (multinomial) distribution over topics for a particular document. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Latent Dirichlet allocation(LDA) is one of the most common algorithms for topic modeling.

Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”

Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.

https://www.tidytextmining.com/topicmodeling.html

library(topicmodels)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
StatCorpus <- Corpus(DirSource("SMCorpus"))
(summary(StatCorpus))
##        Length Class             Mode
## T1.txt 2      PlainTextDocument list
## T2.txt 2      PlainTextDocument list
## T3.txt 2      PlainTextDocument list
## T4.txt 2      PlainTextDocument list
## T5.txt 2      PlainTextDocument list
## T6.txt 2      PlainTextDocument list
MyStopwords<-c("form","have","that","must","very","find","provides")

Stat_dtm <- DocumentTermMatrix(StatCorpus,
                         control = list(
                           #stopwords = TRUE, ## remove normal stopwords
                           wordLengths=c(4, 10), 
                           ## get rid of words of len 4 or smaller or larger than 10
                           removePunctuation = TRUE,
                           removeNumbers = TRUE,
                           tolower=TRUE,
                           #stemming = TRUE,
                           remove_separators = TRUE,
                           stopwords = MyStopwords,
                           removeWords
                           #bounds = list(global = c(minTermFreq, maxTermFreq))
                         ))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
DTM_mat <- as.matrix(Stat_dtm)
DTM_mat[-1,]
##         Terms
## Docs     analysis available carry concepts data deep important order scientists
##   T2.txt        1         0     0        0    1    0         0     0          0
##   T3.txt        1         0     0        0    1    0         0     0          0
##   T4.txt        0         0     0        0    1    0         0     0          0
##   T5.txt        0         0     0        0    1    0         0     0          0
##   T6.txt        0         0     0        0    0    0         0     0          0
##         Terms
## Docs     statistics analyze draw from gather review studies collection concerns
##   T2.txt          2       1    1    1      1      1       1          0        0
##   T3.txt          1       0    0    0      0      0       0          1        1
##   T4.txt          0       0    0    0      0      0       0          0        0
##   T5.txt          1       0    0    0      0      0       0          0        0
##   T6.txt          0       0    0    1      0      0       0          0        0
##         Terms
## Docs     discipline access computer focuses learn learning machine programs
##   T2.txt          0      0        0       0     0        0       0        0
##   T3.txt          1      0        0       0     0        0       0        0
##   T4.txt          0      1        1       1     1        1       1        1
##   T5.txt          0      0        0       0     0        1       1        0
##   T6.txt          0      0        0       0     1        1       1        0
##         Terms
## Docs     themselves algorithms amounts massive patterns ability artificial
##   T2.txt          0          0       0       0        0       0          0
##   T3.txt          0          0       0       0        0       0          0
##   T4.txt          1          0       0       0        0       0          0
##   T5.txt          0          1       1       1        1       0          0
##   T6.txt          0          0       0       0        0       1          1
##         Terms
## Docs     being experience explicitly improve programmed systems without
##   T2.txt     0          0          0       0          0       0       0
##   T3.txt     0          0          0       0          0       0       0
##   T4.txt     0          0          0       0          0       0       0
##   T5.txt     0          0          0       0          0       0       0
##   T6.txt     1          1          1       1          1       1       1
x<- DTM_mat[-1,]
ap_lda <- LDA(x , k = 2, control = list(seed = 1234))
#ap_lda

The tidytext package provides this method for extracting the per-topic-per-word probabilities, called (“beta”), from the model.

library(tidytext)
ap_topics <- tidy(ap_lda, matrix = "beta")
#ap_topics

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term “data” has a 0.066 probability of being generated from topic 1, but a 0.172 probability of being generated from topic 2.

We could use dplyr’s top_n() to find the 10 terms that are most common within each topic. As a tidy data frame, this lends itself well to a ggplot2 visualization.

library(ggplot2)
library(dplyr)

#%>% pipe operater; nested functions
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called (“gamma”), with the matrix = “gamma” argument to tidy().

ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents
## # A tibble: 10 × 3
##    document topic   gamma
##    <chr>    <int>   <dbl>
##  1 T2.txt       1 0.0105 
##  2 T3.txt       1 0.0173 
##  3 T4.txt       1 0.305  
##  4 T5.txt       1 0.987  
##  5 T6.txt       1 0.992  
##  6 T2.txt       2 0.989  
##  7 T3.txt       2 0.983  
##  8 T4.txt       2 0.695  
##  9 T5.txt       2 0.0131 
## 10 T6.txt       2 0.00815

Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that about 99% of the words in document 1 were generated from topic 2.